Problem definition: Given a 2013 sales data for 1559 products across 10
stores in different cities.The aim of this data exploration is to identify the
most important attributesthat play a statictically significant impact on the
sales of each product at a particular store.
========================================================
Explore the data a bit
## [1] "Item_Identifier" "Item_Weight"
## [3] "Item_Fat_Content" "Item_Visibility"
## [5] "Item_Type" "Item_MRP"
## [7] "Outlet_Identifier" "Outlet_Establishment_Year"
## [9] "Outlet_Size" "Outlet_Location_Type"
## [11] "Outlet_Type" "Item_Outlet_Sales"
Dimension of data
## [1] 8523 12
There are 8523 obervation for each store and 12 number of attributes for each
store.
Structure of data
## 'data.frame': 8523 obs. of 12 variables:
## $ Item_Identifier : Factor w/ 1559 levels "DRA12","DRA24",..: 157 9 663 1122 1298 759 697 739 441 991 ...
## $ Item_Weight : num 9.3 5.92 17.5 19.2 8.93 ...
## $ Item_Fat_Content : Factor w/ 5 levels "LF","low fat",..: 3 5 3 5 3 5 5 3 5 5 ...
## $ Item_Visibility : num 0.016 0.0193 0.0168 0 0 ...
## $ Item_Type : Factor w/ 16 levels "Baking Goods",..: 5 15 11 7 10 1 14 14 6 6 ...
## $ Item_MRP : num 249.8 48.3 141.6 182.1 53.9 ...
## $ Outlet_Identifier : Factor w/ 10 levels "OUT010","OUT013",..: 10 4 10 1 2 4 2 6 8 3 ...
## $ Outlet_Establishment_Year: int 1999 2009 1999 1998 1987 2009 1987 1985 2002 2007 ...
## $ Outlet_Size : Factor w/ 4 levels "","High","Medium",..: 3 3 3 1 2 3 2 3 1 1 ...
## $ Outlet_Location_Type : Factor w/ 3 levels "Tier 1","Tier 2",..: 1 3 1 3 3 3 3 3 2 2 ...
## $ Outlet_Type : Factor w/ 4 levels "Grocery Store",..: 2 3 2 1 2 3 2 4 2 2 ...
## $ Item_Outlet_Sales : num 3735 443 2097 732 995 ...
Find any entries which are blanck and fill NA to those entries. It is observed that item_weight has missing vaues.
It is observed that Outlet size and and Item_Visibilities has entries which
are empty or zero.Item Visibility zero means there is no prodcut in the outlet
which does not make any sennse and require data imputation.
Looking at missing data pattern visually The following figure is useful for investigating any structure of missing
observation in the data.Also, the missing pattern could suggest which variables
could potentially be useful for imputation of missing entries.
md.pattern(bigmarts_train)
## Item_Identifier Item_Fat_Content Item_Type Item_MRP Outlet_Identifier
## 4358 1 1 1 1 1
## 1373 1 1 1 1 1
## 292 1 1 1 1 1
## 2266 1 1 1 1 1
## 90 1 1 1 1 1
## 144 1 1 1 1 1
## 0 0 0 0 0
## Outlet_Establishment_Year Outlet_Location_Type Outlet_Type
## 4358 1 1 1
## 1373 1 1 1
## 292 1 1 1
## 2266 1 1 1
## 90 1 1 1
## 144 1 1 1
## 0 0 0
## Item_Outlet_Sales Item_Visibility Item_Weight Outlet_Size
## 4358 1 1 1 1 0
## 1373 1 1 0 1 1
## 292 1 0 1 1 1
## 2266 1 1 1 0 1
## 90 1 0 0 1 2
## 144 1 0 1 0 2
## 0 526 1463 2410 4399
##
## Variables sorted by number of missings:
## Variable Count
## Outlet_Size 0.28276428
## Item_Weight 0.17165317
## Item_Visibility 0.06171536
## Item_Identifier 0.00000000
## Item_Fat_Content 0.00000000
## Item_Type 0.00000000
## Item_MRP 0.00000000
## Outlet_Identifier 0.00000000
## Outlet_Establishment_Year 0.00000000
## Outlet_Location_Type 0.00000000
## Outlet_Type 0.00000000
## Item_Outlet_Sales 0.00000000
## [1] 526
On the left side of historgram the variables are sorted by the number of
missing vlaues.
The variables Outlet_size, Item_Weight, Item_Visibility are missing mostly.
On the rigt side of the graph the blue represent the observed data and the
red represent the missing data.
Cleaning the data: Using imputation statistical techniques
Imputation:imputation is the process of replacing missing data with
substituted values.pmm(Predictive mean matching).There are some values for item
vsibility are zero which is not correct.
Item weight and Outlet size ahve missing values,so needed to fix with imputation.
Display first 5 records
## Item_Identifier Item_Weight Item_Fat_Content Item_Visibility
## 1 FDA15 9.30 Low Fat 0.01604730
## 2 DRC01 5.92 Regular 0.01927822
## 3 FDN15 17.50 Low Fat 0.01676007
## 4 FDX07 19.20 Regular NA
## 5 NCD19 8.93 Low Fat NA
## Item_Type Item_MRP Outlet_Identifier
## 1 Dairy 249.8092 OUT049
## 2 Soft Drinks 48.2692 OUT018
## 3 Meat 141.6180 OUT049
## 4 Fruits and Vegetables 182.0950 OUT010
## 5 Household 53.8614 OUT013
## Outlet_Establishment_Year Outlet_Size Outlet_Location_Type
## 1 1999 Medium Tier 1
## 2 2009 Medium Tier 3
## 3 1999 Medium Tier 1
## 4 1998 <NA> Tier 3
## 5 1987 High Tier 3
## Outlet_Type Item_Outlet_Sales
## 1 Supermarket Type1 3735.1380
## 2 Supermarket Type2 443.4228
## 3 Supermarket Type1 2097.2700
## 4 Grocery Store 732.3800
## 5 Supermarket Type1 994.7052
The data set is divided into two type of variables:
Categorical variables: Item_Identifier, Item_Fat_Content, Item_Type, Outlet_Identifier,Outlet_Size,Outlet_Location_Type,Outlet_Type
Numerical variables: Item_Weight, Item_Visibility, Item_MRP ,
Outlet_Establishment_Year, Item_Outlet_Sales.
Lets look at the summary of data
## Item_Identifier Item_Weight Item_Fat_Content Item_Visibility
## FDG33 : 10 Min. : 4.555 LF : 316 Min. :0.0036
## FDW13 : 10 1st Qu.: 8.774 low fat: 112 1st Qu.:0.0314
## DRE49 : 9 Median :12.600 Low Fat:5089 Median :0.0578
## DRN47 : 9 Mean :12.858 reg : 117 Mean :0.0705
## FDD38 : 9 3rd Qu.:16.850 Regular:2889 3rd Qu.:0.0981
## FDF52 : 9 Max. :21.350 Max. :0.3284
## (Other):8467 NA's :1463 NA's :526
## Item_Type Item_MRP Outlet_Identifier
## Fruits and Vegetables:1232 Min. : 31.29 OUT027 : 935
## Snack Foods :1200 1st Qu.: 93.83 OUT013 : 932
## Household : 910 Median :143.01 OUT035 : 930
## Frozen Foods : 856 Mean :140.99 OUT046 : 930
## Dairy : 682 3rd Qu.:185.64 OUT049 : 930
## Canned : 649 Max. :266.89 OUT045 : 929
## (Other) :2994 (Other):2937
## Outlet_Establishment_Year Outlet_Size Outlet_Location_Type
## Min. :1985 : 0 Tier 1:2388
## 1st Qu.:1987 High : 932 Tier 2:2785
## Median :1999 Medium:2793 Tier 3:3350
## Mean :1998 Small :2388
## 3rd Qu.:2004 NA's :2410
## Max. :2009
##
## Outlet_Type Item_Outlet_Sales
## Grocery Store :1083 Min. : 33.29
## Supermarket Type1:5577 1st Qu.: 834.25
## Supermarket Type2: 928 Median : 1794.33
## Supermarket Type3: 935 Mean : 2181.29
## 3rd Qu.: 3101.30
## Max. :13086.97
##
Examining the output of the summary function, we see that results are different
for continous and categoricaly variables.
It is observed that type fat content in data is two type low fat and regular
which is divided into five caregories in dataset, Low Fat, low fat, LF, reg,
Regular. Lets fix this problem.
# Changing reg to Regular and LF and low fat to Low Fat
bigmarts_train_imp$Item_Fat_Content<-replace(bigmarts_train_imp$Item_Fat_Content,bigmarts_train_imp$Item_Fat_Content=="reg","Regular")
bigmarts_train_imp$Item_Fat_Content<-replace(bigmarts_train_imp$Item_Fat_Content,bigmarts_train_imp$Item_Fat_Content=="LF","Low Fat")
bigmarts_train_imp$Item_Fat_Content<-replace(bigmarts_train_imp$Item_Fat_Content,bigmarts_train_imp$Item_Fat_Content=="low fat","Low Fat")
By looking at the density plot of Item_MRP it shows there are differenet level
of MRP distribution.So divide into 4 level of categories like: Low, Medium,
high , Very high.
We can see that density plot of Item_Visibility are almost similar before and
after imputation.
Univariate Plots Section: Divided into two levels of analysis based on Product level and Outlet level
The long tail distribution of sales shows the small number of items have higher
sales whichis approximately 20% of items have 80% of sales.
Lets us move formward to see the distribution of Sales across the categorcals
variables.
it shows Outlet10 and Outlet19 has very low sales relatively.
Items containing low fat are high in sales as comapred to regular fat content
items.
Its apparent that medium sized outlet has higher sales.
Tier 3 has more sales.
Supermarket Type1, Type3 have high value of sales and they reasonably in good
numbers.
Type of grocery stores are higher in number but they have not much sales.
It is important to understand that how long the outlet is established in the
reference of year 2013 since the data is for year 2013.
bigmarts_train_imp$Total_Year_Establishment <- as.factor(2013-bigmarts_train_imp$Outlet_Establishment_Year)
bigmarts_train_imp$I_O_SaleP <- bigmarts_train_imp$Item_Outlet_Sales * 100 / max(bigmarts_train_imp$Item_Outlet_Sales)
Arrange data so that Item Outlet sales becomes tha last column.
Bigmarts <- subset(bigmarts_train_imp, select = c(Item_Identifier,
Item_Weight,
Item_Fat_Content,
Item_Visibility,
Item_Type,
Item_MRP,
MRP_Level,
Outlet_Identifier,
Outlet_Size,
Outlet_Location_Type,
Outlet_Type,
Total_Year_Establishment,
Item_Outlet_Sales))
Location Tier3 has good sales for older established outlet.
Tier 2 and 3 have good sales for middles aged location outlet.
| Variable | Description |
|---|---|
| Item_Identifier | Unique product ID |
| Item_Weight | Weight of product |
| Item_Fat_Content | Whether the product is low fat or not |
| Item_Visibility | % of total display area in store allocated to this product |
| Item_Type | Category to which product belongs |
| Item_MRP | Maximum Retail Price (list price) of product |
| MRP_Level | Category of MRP like Low, Meduim, High, Vey_High |
| Outlet_Identifier | Unique store ID |
| Total_Year_Establishment | Number of years stired is establushed till 2013 |
| Outlet_Size | Size of the store |
| Outlet_Location_Type | Type of city in which store is located |
| Outlet_Type | Grocery store or some sort of supermarket |
| Item_Outlet_Sales | Sales of product in particular store. This is the target variabe which is to be predicted. |
There are 8523 oobervation with 13 features. Obervations: * There are 1559 unique items and 10 different type of outlets * According to density plot of Item_MRP, it shows there are 4 types
of distribution of data which involves, Low, medium, high and very high. * Number of low fat products are larger than regular products. * Most of the outlet has high and medium MRP level products. * Most popular products are Baking goods, snacks foods, Households,
Fruits and vegetables and Frozen fruits in the outlets. * Most of the outlets are of medium size. * Most of the outlet locations are Tier 3. * OUT035, OUT045, OUT046, OUT027, OUT017 OUT018, OUT013 are mostly available.
The main features in the datasets are Item_MRP level, Item_Weight,
Item_Fat_Content and Item_Outlet_Sales upto this point of analysis.
I would like to determine which features and their combinations are
best to determine the Item sales in the outlet.
I hope to discover more significant features and their combinations by
bivariate and multivariate analysis.
### What other features in the dataset do you think will help support your investigation into your feature(s) of interest? The levels of categorical variables are likely to contribute in the sales
of product in outlet and Total establishement year of the outlet could also
play a role in sales. It is assumed that older the store its more likely
there is more sales of products in that store.
The following variables are created:
MRP_Level: The desity plot of Item_MRP showed four different distributions
of MRP,So I have created 4 levels, Low, Medium, High and Very_High prize of the
product.
Total_Year_Establishment: The establishment year of outlet is subtracted
from 2013.
526 out of 8523 entries of item_Visibility of the product is zero which make
no sense in the outlet.
There is still a sales of the product which have zero visibility.
May be there is mistake in data entry so, I imputed all values which have zero
entries.
# Bivariate Plots Section
Bivariate plots based on Oultet type
From the above distribution of each outlet type Supermarket Type3
has highest sales on total year of 28.
Based on the color schema of Supermarket Type1 has good sales on
total 9 and 14 year of establishment.
Grocery store has lower sales of product.
It is clear that most of the sales are from Tier 3 type of outlet location.
Meduim size outlet sales are higher as shown in above distribution graph.
It is clear that outlet27 has much higher sales as comapred to other outlets.
Bivariate plots based on Product type
The sales of the item type low_fat and regular are almost same for medium
outlet size.
The above graph celarly depict their is linear relatioship between Item MRP
and Outlet sales.
Let see more insight by including color schema and other dimenaions with more
transformations.
## Warning: Removed 132 rows containing non-finite values (stat_smooth).
## Warning: Removed 132 rows containing missing values (geom_point).
I looked at the categorical variables again the Item_Outlet_Sales and Item_MRP.
Applying a log transform to Item_Outlet_Sales and cube-root transform to
Item_MRP produces a more linear trend.
There is almost linear relationship appears between Item_MRP and
Item_Outlet_Sales after certain transformation.
Item_MRP(log10) vs Item_Outlet_Sales(log10) and MRP_Level graph shows almost
perfect linear relation between between two.
By adding color schema we see it almost apparent that MRP_level high and
Very_high of product have high sales values.
But it is also need to understand that high correlation does not necessarily
mean that there is an underlying fundamental relationship between the two
varibles, however can offer some insight for further investigation.
There is no such apparent relation exhibits between Item_Weight and sales from
the above graph.
Item visible more than less than 0.2 are high in sales. Lets take more peep
by breaking into store types.
In Suepr market 1, 2 ,3 items visible leass than 0.2 are more sales as comapred
to grocery store.
There is no such apparent relationship stands out between
Total_Year_Establishment and Item_Outlet_Sales from the verticle strips in
graph but looks like there is kind of nolinear type relationship exists.
But it can be said from above graph that which may be true that oultet which
are old have higher sales.
# Bivariate Analysis
Item_MRP is strongly correlates with the Item_Outlet_Sales.
MRP level Very_high is also positively correlated with Item_Outlet_Sales.
Outlet identifier is also positively correlated with sales.
There is no such apparenet relationship exhibited between
Total_Year_Establishment and Item_Outlet_Sales.
Item_Vsisibity and Item_Outlet_Sales are slightly negatively correlated.
To make it more clear I added color scheme to know which are those item.
Its clear that Baking Soda, Breads, Breakfast items, Dairy products occupy
less sace but still they have more sales. That nake sense about some of the items which take less space to storage in
outlet does not refer does not mean they have less sales.
Its unfair to rely on the visibility percentange on the sales.
But It was more seen items visibility less than 0.2 hs higher in sales.
The Item_Outlet_Sales are strongly correated with Item_MRP.
There are certain stores which are positively correlated with Outlets.
This could be achieved by dummyfying categorcal variables.
# Multivariate Plots Section
Tip: Now it’s time to put everything together. Based on what you found in the bivariate plots section, create a few multivariate plots to investigate more complex interactions between variables. Make sure that the plots that you create here are justified by the plots you explored in the previous section. If you plan on creating any mathematical models, this is the section where you will do that.
it shows from the graph that low fat items are high in outlet sales.
items taged with Low MRP level are very lo sales.
It shows that outlet27 has much higher sales as compared to other outlets.
## $title
## [1] "Item Visibility vs Outlet Sales"
##
## $subtitle
## NULL
##
## attr(,"class")
## [1] "labels"
We see that there is difference between grocery store and super market which
is shown in above scatter plot.
Super market has higher sales.it seemed from the above shown graph that Item_Vsisibity and Item_Outlet_Sales are slightly negatively correlated.
To make it more clear I added color scheme to know which are those item.
Its clear that Baking Soda, Breads, Breakfast items, Dairy products occupy
less sace but still they have more sales.
it also shoes that items with visibility less than 0.2 has high sales.
## [1] 8523 1618
## i j cor p
## 1 Item_Identifier.DRA12 Item_Identifier.DRA24 -0.0007609629 0.9440011
## 2 Item_Identifier.DRA12 Item_Identifier.DRA59 -0.0008135513 0.9401382
## 3 Item_Identifier.DRA24 Item_Identifier.DRA59 -0.0008787875 0.9353482
## 4 Item_Identifier.DRA12 Item_Identifier.DRB01 -0.0004980502 0.9633315
## 5 Item_Identifier.DRA24 Item_Identifier.DRB01 -0.0005379873 0.9603935
## i j cor p
## 1223829 Item_Fat_Content.Low.Fat Item_Fat_Content.Regular -1 0
## 1290406 Outlet_Identifier.OUT018 Outlet_Type.Supermarket.Type2 1 0
## 1292014 Outlet_Identifier.OUT027 Outlet_Type.Supermarket.Type3 1 0
## 1293619 Outlet_Identifier.OUT018 Total_Year_Establishment.4 1 0
## 1293635 Outlet_Type.Supermarket.Type2 Total_Year_Establishment.4 1 0
## 1295226 Outlet_Identifier.OUT017 Total_Year_Establishment.6 1 0
## 1296839 Outlet_Identifier.OUT035 Total_Year_Establishment.9 1 0
## 1298450 Outlet_Identifier.OUT045 Total_Year_Establishment.11 1 0
## 1300063 Outlet_Identifier.OUT049 Total_Year_Establishment.14 1 0
## 1301666 Outlet_Identifier.OUT010 Total_Year_Establishment.15 1 0
Item_Visibility and Item_Weight does not also appear to be correlated.
# Multivariate Analysis
See the importance of variables in followings which contribute to the sales.
I have used dummyVars Function to translate all factor variables into numerical variables(also used for modeling purpose). After dummyVars Function 1618
combinations of feattures are generated out of which 10 most important
featres are shown.
# Dimesion of newly created features.
dim(BigmartsTrsf)
## [1] 8523 1618
# top 10 best features
selectedSub
## i j cor p
## 1308119 Item_MRP Item_Outlet_Sales 0.5675744 0
## 1308141 Outlet_Type.Grocery.Store Item_Outlet_Sales -0.4117271 0
## 1308123 MRP_Level.Very_High Item_Outlet_Sales 0.3944153 0
## 1308121 MRP_Level.Low Item_Outlet_Sales -0.3658668 0
## 1308129 Outlet_Identifier.OUT027 Item_Outlet_Sales 0.3111920 0
## 1308144 Outlet_Type.Supermarket.Type3 Item_Outlet_Sales 0.3111920 0
## 1308124 Outlet_Identifier.OUT010 Item_Outlet_Sales -0.2848826 0
## 1308150 Total_Year_Establishment.15 Item_Outlet_Sales -0.2848826 0
## 1308128 Outlet_Identifier.OUT019 Item_Outlet_Sales -0.2772498 0
## 1308122 MRP_Level.Medium Item_Outlet_Sales -0.2288469 0
Lets us make it more apparent visulization using facet wrap in the following
graph.
ggplot(Bigmarts, aes(Item_Visibility, Item_MRP)) + geom_point(aes(color = Item_Type)) +
scale_x_continuous("Item Visibility", breaks = seq(0,0.35,0.05))+
scale_y_continuous("Item MRP", breaks = seq(0,270,by = 30))+
theme_bw() + labs(title="Scatterplot of Item_MRP and Item_Visibility with Item Type")+ facet_wrap( ~ Item_Type)
This shows items visibility less than 0.2 have high in MRP which is interestig
because we already seen Item_MRP is positively correlated with sales and less
than 0.2 visible items have high in sales.
Bigmarts_corr <- subset(Bigmarts, select = c(Item_Weight,Item_Visibility, Item_MRP, Item_Outlet_Sales))
scatterplotMatrix(Bigmarts_corr[1:4], main = "Scatter plot of numerical variables")
Above the scatter plot shows the correlation between the highly correlated pairs. The diagonal cells show the kernel density plot of each variables.The pairs plot, and in particular the last Item_Outlet_Sales column, tell us a lot about our
data set. Upon examining, it is observed that there is strong correlation exist between
Item_MRP and Item_Outlet_Sales. MRP level Very_high is also positively correlated with Item_Outlet_Sales.
Outlet identyifier is also positively correlated with sales.
Plotting Item_MRP with different facets gave a valuable relation with
Item_Outlet sales.
I was ale to find out the imporant features numerical and categoricals playing
significant role sales.
The top most correlated features are given below.
## [1] "Item_MRP" "Outlet_Type.Grocery.Store"
## [3] "MRP_Level.Very_High" "MRP_Level.Low"
## [5] "Outlet_Identifier.OUT027" "Outlet_Type.Supermarket.Type3"
## [7] "Outlet_Identifier.OUT010" "Total_Year_Establishment.15"
## [9] "Outlet_Identifier.OUT019" "MRP_Level.Medium"
I observed that there is no such correlation appeared between
Total_Year_establishment and Item_Outlet_Sales even after certain
transformation of varibles.
It is strange because but there is some older oultlets which has resonble sales.
One thing also fould interesting thatitems visibility less than 0.2 have high
in MRPwhich is interestig because we already seen Item_MRP is positively
correlated with sales and less than 0.2 visible items have high in sales.
##
## Calls:
## m1: lm(formula = I(log(Item_Outlet_Sales)) ~ I(Item_MRP^(1/3)), data = Bigmarts)
## m2: lm(formula = I(log(Item_Outlet_Sales)) ~ I(Item_MRP^(1/3)) +
## Total_Year_Establishment, data = Bigmarts)
## m3: lm(formula = I(log(Item_Outlet_Sales)) ~ I(Item_MRP^(1/3)) +
## Total_Year_Establishment + MRP_Level, data = Bigmarts)
## m4: lm(formula = I(log(Item_Outlet_Sales)) ~ I(Item_MRP^(1/3)) +
## Total_Year_Establishment + MRP_Level + Outlet_Identifier,
## data = Bigmarts)
## m5: lm(formula = I(log(Item_Outlet_Sales)) ~ I(Item_MRP^(1/3)) +
## Total_Year_Establishment + MRP_Level + Outlet_Identifier +
## Outlet_Location_Type, data = Bigmarts)
##
## ===========================================================================================
## m1 m2 m3 m4 m5
## -------------------------------------------------------------------------------------------
## (Intercept) 4.039*** 4.079*** 3.966*** 4.077*** 4.077***
## (0.058) (0.053) (0.204) (0.148) (0.148)
## I(Item_MRP^(1/3)) 0.642*** 0.640*** 0.664*** 0.644*** 0.644***
## (0.011) (0.009) (0.036) (0.026) (0.026)
## Total_Year_Establishment: 6 0.205*** 0.205*** 0.205*** 0.205***
## (0.033) (0.033) (0.024) (0.024)
## Total_Year_Establishment: 9 0.226*** 0.224*** 0.224*** 0.224***
## (0.033) (0.033) (0.024) (0.024)
## Total_Year_Establishment: 11 0.131*** 0.131*** 0.131*** 0.131***
## (0.033) (0.033) (0.024) (0.024)
## Total_Year_Establishment: 14 0.207*** 0.206*** 0.206*** 0.206***
## (0.033) (0.033) (0.024) (0.024)
## Total_Year_Establishment: 15 -1.788*** -1.788*** -1.788*** -1.788***
## (0.038) (0.038) (0.028) (0.028)
## Total_Year_Establishment: 16 0.166*** 0.166*** 0.166*** 0.166***
## (0.033) (0.033) (0.024) (0.024)
## Total_Year_Establishment: 26 0.146*** 0.146*** 0.145*** 0.145***
## (0.033) (0.033) (0.024) (0.024)
## Total_Year_Establishment: 28 -0.183*** -0.183*** 0.708*** 0.708***
## (0.030) (0.030) (0.024) (0.024)
## MRP_Level: Low -0.033 -0.075 -0.075
## (0.073) (0.053) (0.053)
## MRP_Level: Medium 0.040 0.024 0.024
## (0.036) (0.026) (0.026)
## MRP_Level: Very_High -0.106** -0.091*** -0.091***
## (0.033) (0.024) (0.024)
## Outlet_Identifier: OUT019/OUT010 -2.470*** -2.470***
## (0.028) (0.028)
## -------------------------------------------------------------------------------------------
## R-squared 0.3 0.5 0.5 0.7 0.7
## adj. R-squared 0.3 0.5 0.5 0.7 0.7
## sigma 0.9 0.7 0.7 0.5 0.5
## F 3282.0 966.1 730.0 1870.8 1870.8
## p 0.0 0.0 0.0 0.0 0.0
## Log-likelihood -10849.2 -9238.6 -9221.6 -6483.9 -6483.9
## Deviance 6364.6 4361.5 4344.1 2285.1 2285.1
## AIC 21704.4 18499.2 18471.1 12997.8 12997.8
## BIC 21725.6 18576.7 18569.8 13103.5 13103.5
## N 8523 8523 8523 8523 8523
## ===========================================================================================
Tip: You’ve done a lot of exploration and have built up an understanding of the structure of and relationships between the variables in your dataset. Here, you will select three plots from all of your previous exploration to present here as a summary of some of your most interesting findings. Make sure that you have refined your selected plots for good titling, axis labels (with units), and good aesthetic choices (e.g. color, transparency). After each plot, make sure you justify why you chose each plot by describing what it shows.
The distribution of item sales are in long tails which reflect the real life
phenomenon. The long tail distribution of sales shows the small number of items have higher
sales which is approximately 20% of items have 80% of sales.
The log10 transformation of sales and cuberoot of MRP graph shows the linear
trend. It can be possible to build a linear predictive model.
The scatter plot shows the correlation between the highly correlated pairs.
The diagonal cells show the kernel density plot of each variables.
The pairs plot and in particular the last Item_Outlet_Sales column, tell us a
lot about our data set.
Upon examining, it is observed that there is strong correlation exist between
Item_MRP and Item_Outlet_Sales. MRP level Very_high is also positively correlated with Item_Outlet_Sales.
Outlet identyifier is also positively correlated with sales. Top 10 most correalated feature appear to be shown below.
This could be also possible to find the correlation betweeen categorical
variables.
### Reflection The Bigmarts data is information obout 1559 Item and 10 outles and their sales
for year 2013.
It has 8523 observations about 12 variables.The data set is divided into two
type of variables: Categorical variables: Item_Identifier, Item_Fat_Content,Item_Type,Outlet_Identifier,Outlet_Size,
Outlet_Location_Type,Outlet_Type. Numerical variables: Item_Weight, Item_Visibility, Item_MRP , Outlet_Establishment_Year,
Item_Outlet_Sales I started by understanding the individual variables in the data set and plots
their graph to find the interesting pattern or trends.
The categorcal variables hade many interesting information in the data set.
I explored the Item_Outlet_sales across many variabes and found out the
interesting relations between them. It was clear that there is a positive
correlation between Item_MRP and Item_Outlet_Sales.
The one thing a bit surprising for me was that the density plot of
Item_Outlet_Sales and Item_Visibility are almost same which lead to strong
positive correation but there no relation stand out between two.
This seemed really intereting to combine two cateforcal variables to predict
the sales(eg: Item+Outlet -> Sales).
I have built the linear model between Item_MRP and Sales by adding some
categorcal variables but I believe it could be improved a lot in terms of
accuracy and performaance. I am planning to build a more accurate predictive
model in future when i will be versed in Machine learning with R as this was
the first time I was working with R. But It was amazed ploting graph with R.